text field
BrightCookies at SemEval-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification
Papadopoulou, Foteini, Mutlu, Osman, Özen, Neris, van der Velden, Bas H. M., Hendrickx, Iris, Hürriyetoğlu, Ali
This paper presents our system developed for the SemEval-2025 Task 9: The Food Hazard Detection Challenge. The shared task's objective is to evaluate explainable classification systems for classifying hazards and products in two levels of granularity from food recall incident reports. In this work, we propose text augmentation techniques as a way to improve poor performance on minority classes and compare their effect for each category on various transformer and machine learning models. We explore three word-level data augmentation techniques, namely synonym replacement, random word swapping, and contextual word insertion. The results show that transformer models tend to have a better overall performance. None of the three augmentation techniques consistently improved overall performance for classifying hazards and products. We observed a statistically significant improvement (P < 0.05) in the fine-grained categories when using the BERT model to compare the baseline with each augmented model. Compared to the baseline, the contextual words insertion augmentation improved the accuracy of predictions for the minority hazard classes by 6%. This suggests that targeted augmentation of minority classes can improve the performance of transformer models.
DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis
Lerouge, Julien, Betmont, Guillaume, Bres, Thomas, Stepankevich, Evgeny, Bergès, Alexis
Identity document (ID) image analysis has become essential for many online services, like bank account opening or insurance subscription. In recent years, much research has been conducted on subjects like document localization, text recognition and fraud detection, to achieve a level of accuracy reliable enough to automatize identity verification. However, there are only a few available datasets to benchmark ID analysis methods, mainly because of privacy restrictions, security requirements and legal reasons. In this paper, we present the DocXPand-25k dataset, which consists of 24,994 richly labeled IDs images, generated using custom-made vectorial templates representing nine fictitious ID designs, including four identity cards, two residence permits and three passports designs. These synthetic IDs feature artificially generated personal information (names, dates, identifiers, faces, barcodes, ...), and present a rich diversity in the visual layouts and textual contents. We collected about 5.8k diverse backgrounds coming from real-world photos, scans and screenshots of IDs to guarantee the variety of the backgrounds. The software we wrote to generate these images has been published (https://github.com/QuickSign/docxpand/) under the terms of the MIT license, and our dataset has been published (https://github.com/QuickSign/docxpand/releases/tag/v1.0.0) under the terms of the CC-BY-NC-SA 4.0 License.
This stupid mistake in Logitech's AI-powered mouse is driving me mad
I'm trying to love Logitech's Signature AI Edition M750 Wireless Mouse. But I'm continually tripping over this small detail, and it literally is making me more frustrated than I should be. It's an intriguing, polarizing concept: By placing an "AI" button on the top of the mouse, Logitech gives users one-click access to ChatGPT to do one thing: simplify the writing and rewriting of text via an AI app called Logi Prompt Builder. It's a novel idea, and one you might expect Microsoft to develop, given the Copilot key that's already appearing on laptop keyboards. Instead, Logitech is leading the way.
ConFit: Improving Resume-Job Matching using Data Augmentation and Contrastive Learning
Yu, Xiao, Zhang, Jinzhong, Yu, Zhou
A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes, and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction records in resume-job datasets are sparse. Different from many prior work that use complex modeling techniques, we tackle this sparsity problem using data augmentations and a simple contrastive learning approach. ConFit first creates an augmented resume-job dataset by paraphrasing specific sections in a resume or a job post. Then, ConFit uses contrastive learning to further increase training samples from $B$ pairs per batch to $O(B^2)$ per batch. We evaluate ConFit on two real-world datasets and find it outperforms prior methods (including BM25 and OpenAI text-ada-002) by up to 19% and 31% absolute in nDCG@10 for ranking jobs and ranking resumes, respectively.
RMDM: A Multilabel Fakenews Dataset for Vietnamese Evidence Verification
Nguyen, Hai-Long, Pham, Thi-Kieu-Trang, Le, Thai-Son, Nguyen, Tan-Minh, Vuong, Thi-Hai-Yen, Nguyen, Ha-Thanh
In this study, we present a novel and challenging multilabel Vietnamese dataset (RMDM) designed to assess the performance of large language models (LLMs), in verifying electronic information related to legal contexts, focusing on fake news as potential input for electronic evidence. The RMDM dataset comprises four labels: real, mis, dis, and mal, representing real information, misinformation, disinformation, and mal-information, respectively. By including these diverse labels, RMDM captures the complexities of differing fake news categories and offers insights into the abilities of different language models to handle various types of information that could be part of electronic evidence. The dataset consists of a total of 1,556 samples, with 389 samples for each label. Preliminary tests on the dataset using GPT-based and BERT-based models reveal variations in the models' performance across different labels, indicating that the dataset effectively challenges the ability of various language models to verify the authenticity of such information. Our findings suggest that verifying electronic information related to legal contexts, including fake news, remains a difficult problem for language models, warranting further attention from the research community to advance toward more reliable AI models for potential legal applications.
Benchmarking Multimodal AutoML for Tabular Data with Text Fields
Shi, Xingjian, Mueller, Jonas, Erickson, Nick, Li, Mu, Smola, Alexander J.
We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.
SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark
Zhong, Victor, Hanjie, Austin W., Wang, Sida I., Narasimhan, Karthik, Zettlemoyer, Luke
Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity. In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. Moreover, we find that many recent modelling advances do not result in significant gains on environments other than the one they were designed for. This highlights the need for a multi-environment benchmark. Finally, the best models significantly underperform humans on SILG, which suggests ample room for future work. We hope SILG enables the community to quickly identify new methodologies for language grounding that generalize to a diverse set of environments and their associated challenges.
Redefining Cancer Treatment- The Memorial Sloan Way
Whenever a patient has symptoms of cancer, the cancer tumour is taken out and sequenced. Genetic information in the tumor cell is stored in the form of DNA. It is then transcribed to form RNA which is then translated to form proteins/amino acids. In case of a mutation, or a mistake in DNA sequence, the resultant amino acid is affected giving rise to a variation for the particular gene. Thousands of genetic mutations may be present in the sequence. We need to distinguish the malignant mutations (drivers leading to tumour growth) from the benign (passenger) ones.